Week 4: Data visualisation

Charlotte Hadley

Topics for today

  1. Why do we use charts to tell stories?

  2. Evidence-based visual perception theory

  3. Advice on choosing charts

  4. Advice on using colour in charts

  5. Using this advice to tell stories with charts built with {ggplot2}

Why do we use charts??

A picture is worth a thousand words

Data visualisations are demonstrably useful

There is considerable experimental evidence for data visualisations improving:

  • Comprehension of data

  • Decision making accuracy and confidence


Evidence has been collected using eye-tracking, survey filling and interviews.

For a good overview of the available research see Eberhard 20211.

Some of these studies consider tables to be a type of data visualisation.

I agree with this! Tables are often awesome choices for presenting data - let’s talk more about this later today.

Data visualisations are demonstrably useful

In 1973 Anscombe2 published a paper designed to demonstrate…

Graphs are essential to good statistical analysis.

To do so he simulated 4 datasets sharing many identical statistical properties.

Data visualisations are demonstrably useful

However, if you visualised the datasets it was obvious these datasets were fundamentally different to one another.

These charts are now known as Anscombe’s quartet2.

Data visualisations are demonstrably useful

The “Datasaurus Dozen” is a modern reimagining of the original quartet3.


Datasaurus was originally created by Alberto Cairo4.


… there’s now an R package for building your own metamers eliocamp.github.io/metamer/



ALWAYS.

Always visualise your datasets.

Data visualisations are demonstrably useful

There are several historical visualisations that have fundamentally changed social policy and behaviour.


This is a map from John Snow in 18555 that ties a cholera outbreak to a specific water pump.


Combined with Snow’s statistical analyses this was a significant step towards the development and acceptance of germ theory.

Data visualisations are demonstrably useful

In exactly the same year, Florence Nightingale6 was creating charts to demonstrate the importance of basic sanitation in military hospitals.


This specific chart is very dramatic and quite rarely used. It’s a polar area diagram or a Nightingale rose diagram


But it’s important to acknowledge that Nightingale used many different types of charts in her work.

Her charts and analyses were central to bringing basic sanitation standards to nursing and hospitals.

Data visualisations are demonstrably useful

In 2006 Hans Rosling7 gave an incredible TED talk where he introduced animated bubble charts as a tool to tell stories about global development.


These charts helped demonstrate the value of interactive and animated data visualisations - which is why Google bought the tool behind the charts!

Data visualisations are demonstrably useful

A more recent example of a very powerful data visualisation is the spiralling global temperature GIF from 2016 by Ed Hawkins8.


We can create animated GIF with {ggplot2} via the {gganimate} package. In fact, Pat Schloss9 has a YouTube video and GitHub repo recreating this chart with R.

Evidence-based visual perception theory

Evidence-based visual perception theory

There is a wealth of evidence-based research in how precisely or accurately charts are perceived by readers.


Source: Wikimedia.org

Our evidence comes from:

  • Eye tracking. We’re really good at measuring where the eye is looking, for how long and how intently.

  • Asking trial participants to estimate or compare values in charts.

There are open debates1 on how our internal visual perception system works - what the brain is doing.

1A good example is pie charts where we’re still not sure what our brains are doing, but we know they’re not measuring area thanks to Robert Kosara10

Elementary perceptual tasks

Back in 1984 Cleveland & McGill11 published their seminal paper on graphical perception theory where they defined “elementary perceptual tasks”.


This study is the backbone of much of the research in this field.

Elementary perceptual tasks

Cleveland & McGill11 designed many experiments where participants were asked to:

  • Identify the largest/smallest segment

  • Estimate what % the smaller segment was of the larger segment

The accuracy of subject estimates was then statistically analysed.

Crowd-sourced evidence for perception theory

Heer & Bostock12 replicated this study using Amazon’s Mechanical Turk with 3,481 participants in 2010.


They validated the results of Cleveland & McGill11 and provided further evidence that…

There is a hierarchy of elementary perceptual tasks - or chart elements - when accuracy matters.

Ordering channels of communication (by accuracy)

Images from Beecham et al13

… real-world applications of visual perception theory (I)

Images from Robert Kosara14

… real-world applications of visual perception theory (II)

Images from Robert Kosara14

… real-world applications of visual perception theory (III)

Image found on Twitter from @irg_bio15 - code for chart available from GitHub16.

Why is someone reading measuring your chart?

Why is someone reading measuring your chart?

To extract accurate values

The magnitude of chart elements.


To quantatively compare values.

The part to whole or relative magnitude of chart elements.


To find the largest/smallest value.

The ranking of chart elements


To find unusual values.

The distribution, ranking or magnitude of chart elements

Why is someone reading your chart?

You have a story you want to tell

There’s lots we can do to help guide the reader to understand your chart and follow the story you’re telling. We’ll cover some examples during this course.


The reader wants to see the data

Charts (and tables) are the best way to see the “big picture” of a dataset - a single value (eg mean) is kind of useless. Interactivity is really useful to allow readers to properly explore the dataset.


The reader has a preconception about the data

Readers might be approaching a chart biased with a particular theory about the data. We can do our best to make our charts easy to read and avoid common pitfalls.

How do we choose a chart?

Use data columns to choose charts

Use your story to choose charts

data-to-viz.com

This site also provides simple to follow instructions for using {ggplot2} to build every single chart type you can find on the website.

ft-interactive.github.io/visual-vocabulary

The Visual Vocabulary is a really useful tool for thinking about how to tell your story with a chart.

Lots of the dataviz at the FT is done with R. John Burn-Murdoch17 is a great source to follow.

{ggplot2} for charts

ggplot2: A Grammar of Graphics

{ggplot2} is an incredibly powerful and flexible tool for building static dataviz.

We can build (almost)1 any static chart we can conceive of.

[1] - Dual y-axis charts must be transformations of one another (for good reasons)

Building blocks of a {ggplot2} chart

Aesthetics

Geoms

Scales

Guides

Theme

Aesthetics

Aesthetics are used to create mappings between columns in our datasets and the coordinate systems of our chart:


msleep %>% 
  ggplot() +
  aes(
    x = sleep_total,
    y = sleep_rem,
    colour = vore
  )


{ggplot2} uses tidy evaluation to allow us to use bare column names in our code.

Aesthetics

Where is aes() placed? What it does
Inside ggplot() or on its own Sets the aesthetics for the entire {ggplot2} object.

These could be considered the coordinate system aes()
Inside geom_*() Sets aesthetics for a specific geom within the existing coordinate system aes() for the {ggplot2} object.

These should be considered geom specific aes()

Geoms

Geoms use the aesthetics to add layers to our charts.

msleep %>% 
  ggplot() +
  aes(
    x = sleep_total,
    y = sleep_rem,
    colour = vore
  ) +
  geom_point()


There are 50+ geoms baked into the {ggplot2} package.

geom_abline(), geom_area(), geom_bar(), geom_bin2d(), geom_blank(), geom_boxplot(), geom_col(), geom_contour(), geom_contour_filled(), geom_count(), geom_crossbar(), geom_curve(), geom_density(), geom_density_2d(), geom_density_2d_filled(), geom_density2d(), geom_density2d_filled(), geom_dotplot(), geom_errorbar(), geom_errorbarh(), geom_freqpoly(), geom_function(), geom_hex(), geom_histogram(), geom_hline(), geom_jitter(), geom_label(), geom_line(), geom_linerange(), geom_map(), geom_path(), geom_point(), geom_pointrange(), geom_polygon(), geom_qq(), geom_qq_line(), geom_quantile(), geom_raster(), geom_rect(), geom_ribbon(), geom_rug(), geom_segment(), geom_sf(), geom_sf_label(), geom_sf_text(), geom_smooth(), geom_spoke(), geom_step(), geom_text(), geom_tile(), geom_violin(), geom_vline()


As we’ll see later, there are many {ggplot2} extension packages that add even more geoms to the mix.

Some geoms are built from others (I)

geom_histogram() has clever tricks to make useful histograms

ggplot(quakes, aes(mag)) +
  geom_histogram()

It’s built by calling geom_bar()

ggplot(quakes, aes(mag)) +
  geom_bar() +
  scale_x_binned()

Some geoms are built from others (II)

But geom_bar() itself is built from geom_rect().

rect_data <- tribble(
  ~x_min, ~x_max, ~y_min, ~y_max,
  4, 4.48, 0, 60,
  4.5, 5.48, 0, 100,
  5.5, 5.98, 0, 10
)
rect_data %>% 
  ggplot() +
  geom_rect(aes(xmin = x_min, 
                xmax = x_max, 
                ymin = y_min, 
                ymax = y_max)) +
    theme_gray(base_size = 25)


There are 8 primitives from which all other geoms are built:

geom_blank(), geom_path(), geom_point(), geom_polygon(), geom_rect(), geom_ribbon(), geom_segment(), geom_text()

All geoms have x and y aesthetics

These tell the geom where it needs to be drawn:

starwars %>% 
  ggplot() +
  aes(x = height,
      y = mass) +
  geom_point()

Some geoms need more than just x and y

Let’s geom_segment() to visualise some of the eras of the dinosaurs:

dinosaurs <- tribble(
  ~period, ~start, ~end,
  "Triassic Period", -251e6, -225e6,
  "Late Triassic Period", -225e6, -200e6,
  "Jurassic Period", -200e6, -150e6,
  "Late Jurassic Period", -150e6, -145e6
)

To build this chart we need to specify all of the following: x, xend, y and yend.

Use size to affect geom size

In many charts we want geoms to be thicker, bigger or just be more prominent.

Timeline (or Gantt charts) are good examples of this. We want the segments to be thicker to improve the readability of the chart - this comes down to the size aesthetic.

dinosaurs %>% 
  ggplot() +
  aes(x = start, xend = end,
      y = period, yend = period) +
  geom_segment(size = 30)

Out of order dinosaurs

This is still a bad chart.

The eras are not ordered in geological time, instead they’re ordered (reverse) alphabetically.

To control the order of things in {ggplot2} charts we must use factors - which are picked up by the scales.

Some geoms are designed to save time

geom_bar() defaults to counting instances of a variable.

mpg %>% 
  count(manufacturer) %>% 
  ggplot() +
  geom_bar(aes(manufacturer))

geom_col() uses a column to dictate the length of bars.

mpg %>% 
  count(manufacturer) %>% 
  ggplot() +
  geom_col(aes(x = manufacturer, y = n))

Some geoms depend on stat functions

The geom_bar() function has a stat argument with the default value of "count".

We can force the geom to behave like geom_col() by changing the stat:

mpg %>% 
  count(manufacturer) %>% 
  ggplot() +
  geom_bar(aes(x = manufacturer,
               y = n),
           stat = "identity")


All of the goodness from the stat argument comes from the stat_identity() and stat_count() functions.

If you’re building a complex chart it might be useful to directly call a stat_() function.

Position things to resolve overlapping (I)

Box and whisker diagrams hide a lot of detail

bechdel %>% 
  filter(complete.cases(.),
         domgross_2013 < 0.5e9) %>% 
  ggplot(aes(clean_test, 
             domgross_2013)) +
  geom_boxplot() +
  theme_gray(base_size = 25)

Let’s add the data points to this chart with geom_point() and look at the position argument.

Position things to resolve overlapping (II)

The position argument can also be used to create three different types of bar chart:

  • “stack” creates a stacked bar chart

  • “fill” creates a proportional bar chart

  • “dodge” creates a grouped bar chart

Let’s create all 3 of these for the following dataset:

gss_cat %>% 
  count(relig, marital)
# A tibble: 78 × 3
   relig      marital           n
   <fct>      <fct>         <int>
 1 No answer  No answer         4
 2 No answer  Never married    22
 3 No answer  Separated         3
 4 No answer  Divorced         13
 5 No answer  Widowed           7
 6 No answer  Married          44
 7 Don't know Never married     6
 8 Don't know Separated         3
 9 Don't know Divorced          1
10 Don't know Married           5
# … with 68 more rows

Geom layers are placed on top of one another

ggplot(mpg, 
       aes(displ, hwy)) +
  geom_point() +
  geom_smooth(method = lm, 
              formula = y ~ splines::bs(x, 3),
              size = 5)

The geom_smooth() line is hiding data points.

We could either swap the order of these geoms or change the alpha aesthetic.

Scales

Scales determine the appearance of an aesthetic within the chart, including:

  • Which colours are used for each value

  • Which order the values appear in the chart and guides


msleep %>% 
  ggplot() +
  aes(
    x = sleep_total,
    y = sleep_rem,
    colour = vore
  ) +
  geom_point() +
  scale_colour_manual(
    values = c("carni" = "#c03728", 
               "omni" = "#fd8f24", 
               "insecti" = "#f5c04a", 
               "herbi" = "#919c4c", 
               "NA" = "#e68c7c")
  )


{ggplot2} uses tidy evaluation to allow us to use bare column names in our code.

DRAGONS:

  • Some rules to follow:

    • Don’t use pies for more than a few groups (ideally 2)

    • If using bubble charts vary by area instead of radius

    • Colour schemes matter A LOT

XXX DRAGONS XXX

  • Dataviz are demonstrably awesome

  • Visual Perception Theory

  • Let’s make some actual ggplot2 charts / Grammar of Graphics

  • Deciding on charts (FT Visual Vocabulary)

  • Things to avoid / Advice (does this belong in visual perception theory?)

    • Starting at zero

    • Lots of different types of charts

    • Dynamite charts

  • Sucking quote

“There is no way of knowing nothing about a subject to knowing something about a subject without going through a period of much frustration and suckiness.” “Push through. You’ll suck less.” Hadley Wickham, author of ggplot2

  • Tables

    • Better than just text
    • But can be overwhelming
      • See doi.org/10.1017/S1537592707072209 for excellent examples
    • Using sparklines can really help

Workshop

Factors.

References

1.
Eberhard, K. The effects of visualization on judgment and decision-making: A systematic literature review. Management Review Quarterly (2021) doi:10.1007/s11301-021-00235-8.
2.
Anscombe, F. J. Graphs in Statistical Analysis. The American Statistician 27, 17–21 (1973).
3.
Matejka, J. & Fitzmaurice, G. Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems 1290–1294 (Association for Computing Machinery, 2017). doi:10.1145/3025453.3025912.
4.
Cairo, A. Download the Datasaurus: Never trust summary statistics alone; always visualize your data. (2016).
5.
Snow, J. On the mode of communication of cholera. (John Churchill, 1855).
6.
Nightingale, F. Notes on Matters Affecting the Health, Efficiency and Hospital Administration of the British Army. (Harrison & Sons, 1858).
7.
Hans Rosling. The best stats you’ve ever seen [Video]. The best stats you’ve ever seen (2006).
8.
Hawkins, E. Spiralling global temperatures | Climate Lab Book. (2016).
9.
Pat Schloss. Recreating animated climate temperature spirals in R with Ggplot2 and gganimate (CC219). (2022).
10.
Kosara, R. & Skau, D. Judgment Error in Pie Chart Variations. EuroVis 2016 - Short Papers 5 pages (2016) doi:10.2312/EUROVISSHORT.20161167.
11.
Cleveland, W. S. & McGill, R. Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods. Journal of the American Statistical Association 79, 531–554 (1984).
12.
Heer, J. & Bostock, M. Crowdsourcing graphical perception: Using mechanical turk to assess visualization design. in Proceedings of the 28th international conference on Human factors in computing systems - CHI ’10 203 (ACM Press, 2010). doi:10.1145/1753326.1753357.
13.
Beecham, R., Dykes, J., Hama, L. & Lomax, N. On the Use of Glyphmaps for Analysing the Scale and Temporal Spread of COVID-19 Reported Cases. ISPRS International Journal of Geo-Information 10, 213 (2021).
14.
Kosara, R. More Than Meets the Eye: A Closer Look at Encodings in Visualization. IEEE Computer Graphics and Applications 42, 110–114 (2022).
15.
Iker Rivas-González [@irg_bio]. I am also joining the hexbin fever! 🐝 For this week’s #TidyTuesday, I plotted the number of bee colonies in the US by year and season. It seems like cold and warm states have different patterns of seasonal changes. Code: https://github.com/rivasiker/TidyTuesday/blob/main/2022/2022-01-11/analysis_2022-01-11.Rmd #RStats #DataViz #ggplot2 https://t.co/OYGyg2az7M. Twitter (2022).
16.
Rivas-González, I. Seasonality in bee colonies with hexbin geofacets. (2022).
17.
Burn-Murdoch, J. Ggplot2 as a Creativity Engine. in EARL 2016 (2016).